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SUMMARY  PACK 


THE  PROBLEM 

Human  performance  testing  in  unusual  environments  such  as  ship 
motion  and  vibration  almost  always  involves  repeated  testing  of  the  same 
individuals.  The  purpose  of  the  Performance  Evaluation  Tests  for  Environ¬ 
mental  Research  (PETER)  program  was  to  standardize  a  test  battery  for  use 
in  repeated  measures  experiments. 


FINDINGS 


The  Performance  Evaluation  Tests  for  Environmental  Research  (PETER) 
program  was  begun  at  NBDL  in  1977.  This  report  includes  four  papers  which 
were  written  between  1977  and  1980  describing  progress  and  developments  in 
this  program.  "An  Engineering  Approach  to  the  Standardization  of  Performance 
Evaluation  Tests  for  Environmental  Research  (PETER)"  delineates  the  structure 
of  the  PETER  paradigm,  describes  representative  results  and  discusses  impli¬ 
cations  of  the  results  to  previous  and  future  research.  "Assessing  Produc¬ 
tivity  and  Well-Being  in  Navy  Workplaces"  explains  how  Jones'  rate-terminal 
theory  of  skill  acquisition  has  been  applied  to  the  study  of  complex  human 
performance  and  abilities.  Examples  from  two  tests  administered  under  a 
fifteen  day  repeated  measures  paradigm  are  presented  to  illustrate  the  method¬ 
ological  approach  employed  in  the  PETER  program.  Application  of  these  methods 
to  selection  and  training  research  is  suggested.  "Progress  in  the  Analysis 
of  a  Performance  Evaluation  Test  for  Environmental  Research  (PETER)"  describes 
the  preliminary  results  of  ten  tests  which  had  been  completed  by  October  1978. 
"The  Development  of  a  Navy  Performance  Evaluation  Test  for  Environmental 
Research  (PETER)” describes  the  earliest  plan  for  developing  PETER  as  it  was 
formulated  in  1977.  It  describes  the  philosophy  and  principles  upon  which  the 
PETER  program  was  based. 


RECOMMENDATIONS 

It  is  recommended  that  only  stable  and  reliable  tests  be  used  in 
repeated  measures  experiments. 
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ABSTRACT 


Many  investigators  have  documented  the  problems  of  measuring  performance  In  unusual  en¬ 
vironments.  Reliable,  valid,  and  standardized  test  batteries  for  repeated  administrations  have 
not  been  previously  developed.  This  paper  describes  progress  in  developing  such  a  battery: 
Performance  Evaluation  Tests  for  Environmental  Research  (PETER).  In  this  progran,  the  stabi¬ 
lity  and  sensitivity  of  performance  tasks  are  studied  over  repeated  sessions  (15  days).  The 
approach  has  been  to  test,  at  the  same  time  of  day,  the  same  group  of  20  healthy  subjects  in 
order  to  provide  baselines  and  expected  values.  Thus  far,  48  cognitive,  perceptual  and  psycho¬ 
motor  tasks,  mainly  from  the  research  literature,  have  been  partially  or  completely  evaluated. 
Subjecting  these  tasks  to  protracted  practice  reveals  the  following:  (1)  Most  task  perfor¬ 
mances  do  not  asymptote,  (2)  most  standard  deviations  are  either  homogeneous  or  they  became 
regular,  (3)  and,  more  importantly,  changes  in  reliabilities  occur  which  cannot  be  anticipated 
from  their  means  and  standard  deviations.  The  latter  has  not  been  commented  upon  before  in 
this  context.  Based  on  these  findings,  it  is  believed  that  most  previous  environmental  studies 
which  employed  a  repeated  measures  paradigm  should  be  seriously  questioned  or  critically  re¬ 
examined  . 


INTRODUCTION 


An  "engineering  approach"  to  the  development  and  standardization  of  the  Performance  Eval¬ 
uation  Tests  for  Environmental  Research  (PETER)  battery  has  been  previously  proposed  (Kennedy  6 
Bittner,  1977).  This  engineering  approach  is  directed  at  the  test  and  evaluation  (T&E)  of 
performance  tasks  prior  to  their  being  onployed  for  assessment  of  environmental  effects.  This 
T&E  of  performance  tasks  is  similar  to  that  which  an  engineer  conducts  to  assess  the  stability 
of  an  instrument  prior  to  its  utilization.  The  goal  of  the  PETER  program  is  to  study  the 
possibly  adverse  effects  of  ship  motion  on  performance.  However,  because  PETER  is  being  de¬ 
signed  for  repeated  administrations,  it  will  be  directly  applicable  to  studies  in  other  environ¬ 
ments  and  treatments  (e.g.,  hyperbaric,  thermal,  drug). 


TABLE  1 

CATEGORIES  IN  THE  STUDY  OF  ENVIRONMENTAL  STRESS  ON  HUMANS 


ADVERSE  EFFECT  CATEGORIES 


DEFINITION 


1.  HEALTH  AND  SAFETY: 

2.  COMFORT: 

3.  I/O  QUALITY: 


It  exceeds  medical  limits  adequate 
for  safety  and  health. 

It  is  unpleasant,  causes  discomfort, 

A  physical  aspect  of  the  environnent 
interacts  to  modify  the  Input/output 
quality  of  stimulus  or  response. 

It  occasions  major,  identifiable 
changes  in  central  nervous  system 
(or  "throughput")  functioning! 


4.  CNS  PROBLEMS: 


Initially,  PETER  is  being  aimed  at  aaaeasing  central  nervous  system  (CNS)  functioning.  CNS 
Problems  as  seen  in  Table  1  can  be  contrasted  with  other  categories  of  "performance"  decrements 
including:  Health  and  Safety,  Comfort,  and  Input/Output  (I/O)  Quality.  Examples  for  each  of 
these  four  categories  appear  in  Table  2  for  inertial  environments  and  in  Table  3  for  hyper¬ 
baric.  All  of  these  categories  are  of  concern  to  the  individual  who  has  the  responsibil ty  for 
managing  human  effectiveness  in  a  civilian  or  military  setting,  but  each  category  implies  a 
different  type  of  performance  degradation.  The  scientific  and  military  literature  rarely  have 
distinguished  between  these  categories.  However,  it  is  evident  from  inspection  of  Tables  1,  2, 
and  3  that  specifying  a  category  can  imply  the  research  strategy  necessary  for  further  study. 
Although  present  focus  is  on  CNS  Problans,  future  work  will  include  the  study  of  Comfort  and 
I/O  Quality  Problems. 

The  purpose  of  this  report  is  to  delineate  the  structure  of  the  PETER  paradigm,  describe 
representative  results  of  the  application,  and  discuss  implications  of  the  results  to  previous 
and  future  research. 


TABLE  2 
SHIP  MOTION 


ADVERSE  EFFECT  CATEGORIES 

ILLUSTRATIVE  PROBLEMS 

1. 

HEALTH  AND  SAFETY: 

Vomiting  results  in  dehydration  and 
accompanying  problans. 

2. 

COMFORT: 

Nausea 

3. 

I/O  QUALITY: 

Input: 

Movement  of  the  platform  may  Jiggle 
the  image  presented  to  the  retina. 

Out  put : 

Body  sway  decreases  limb  steadiness. 

4. 

CNS  PROBLEMS: 

Id lopathic : 

Soporific  effects  of  motion 

Nonldlopathlc: 

Estimates  of  the  rate  of  passage  of  time 
have  greater  error  during  motion. 

TABLE  3 

HYPERBARIA 

ADVERSE  EFFECT  CATEGORIES 

ILLUSTRATIVE  PROBLEMS 

1.  HEALTH  AND  SAFETY: 

Aseptic  necrosis 

2.  COMFORT: 

Joint  pain 

3.  I/O  QUALITY: 

Input: 

Chanber  noise 

Output: 

Limb  tremor 

4.  CNS  PROBLEMS: 

Id lopathic  : 

High  Pressure  Nervous  System  Syndrome 

Nonldlopathlc: 

Narcosis 

THE  PETER  PA  RAD  I  (X 


Method 

Task  Selection.  The  strategy  In  PETER  has  been  to  consider  tasks  which  purport  to  assess 
mental  work.  Initially,  tasks  which  meet  one  or  more  of  the  following  criteria  are  being 
selected  for  test  and  evaluation:  (1)  task  performance  has  been  reported  to  be  disrupted  in  a 
thermal,  o_r  inertial  ojr  hyperbaric  environment;  (2)  a  concurrence  In  the  scientific  literature 
that  some  element  of  cognition.  Information  processing,  memory,  etc.,  is  being  assessed  by  the 
task;  or  (3)  the  task  distinguishes  normal  from  brain  damaged  populations.  This  strategy  Is 
directed  at  obtaining  a  comprehensive  selection  of  old  and  new  tasks.  In  future  studies,  more 
real  world  oriented  tasks  will  be  examined  along  with  these  laboratory  tasks. 

Subjects.  Twenty  full  time  research  subjects  form  the  experimental  population.  These  men 
are  fit,  average  or  above  in  Intelligence,  motivated  to  perform,  and  under  constant  military 
supervision  and  daily  medical  assessment  (Thomas,  Majewskl,  Ewing  4  Gilbert,  1977).  All  volun¬ 
teer  subjects  were  recruited  and  evaluated  In  accordance  with  procedures  specified  In  Secretary 
of  the  Navy  Instruction  3900.39  and  Bureau  of  Medicine  Instruction  3900.6.  The  instructions 
require  voluntary  Informed  consent  and  meet  prevailing  national  and  international  guidelines. 

Analys Is .  The  test  and  evaluation  plan  is  to  obtain  descriptive  statistics  for  each  test 
as  it  Is  performed  for  15  workday  mornings  (8  -  10  AM).  Analyses  of  means,  standard  deviations 
and  correlations  are  used  in  the  evaluation  of  tasks.  Means  over  days  and  across  subjects  are 

analyzed  to  see  whether  they  meet  any  of  three  criteria  for  mean  stability:  (1)  plateau,  or 

level  across  trials,  (2)  asymptotic,  or  approach  to  unchanging  values  after  some  point  in 
training,  or  (3)  slow,  approximately  linear  increase,  after  some  number  of  trials.  Further, 
standard  deviations  across  subjects  are  examined  to  see  whether  they  are  "stable"  (i.e.,  con¬ 
stant)  after  some  point  in  training.  Lastly,  cross  trial  reliabilities  are  studied  to  see 
whether  they  are  "differentially  stable",  that  is,  have  constant  correlations  with  subsequent 
trials  after  some  point  in  training  (Jones,  1969,  1972).  If  criteria  for  the  stability  of  the 
means,  standard  deviations  and  correlations  are  met,  then  a  task  can  be  recommended  for  tenta¬ 
tive  inclusion  in  the  PETER  battery.  Ultimate  inclusion  in  PETER  will  depend  on  factorial 

uniqueness  and  validity  analyses  which  will  be  conducted  in  later  stages  of  PETER  development. 

Rationale 

The  T&E  approach  described  above  was  motivated  by  the  pre.  per,  post  (PPP),  paradigm 
typically  employed  in  environmental  assessment  research.  The  PPP  paradigm  assesses  subjects 
for  a  number  of  trials:  pre-expo su re;  during  or  per-exposu re;  and  post-exposure.  In  these 
studies,  small  numbers  of  subjects,  frequently  less  than  six,  are  generally  employed  and  simple 
repeated  measures  ANOVA  are  used  to  analyze  the  results.  The  PPP  paradigm  has  many  variants 
(e.g.,  addition  of  a  nonexposure  control  group).  However,  whatever  variant,  the  PPP  paradigm 
has  stringent  requirements  which  must  be  met  before  results  can  be  analyzed  and  Interpreted 
meaningfully  . 

The  criteria  for  mean,  standard  deviation,  and  correlation  stability  which  are  delineated 
above,  must  be  met  if  the  PPP  paradigm  is  to  be  employed.  In  particular,  changes  in  means  over 
trials,  other  than  slow,  linear  changes,  can  hide  change  due  to  an  environment.  For  example, 
if  means  are  changing  over  sessions  when  an  environmental  condition  is  encountered,  it  may  not 
be  determined  idiether  it  was  overall  level  of  pe rformance  which  was  disrupted  or  the  learning. 
In  addition,  failure  to  meet  either  the  standard  deviation  or  reliability  correlation  require¬ 
ments  is  equivalent  to  violating  the  compound  symmetry  assumptions  of  simple  repeated  measures 
ANOVA  (Winer,  1972).  Multivariate  analysis  methods  might  appear  to  offer  an  alternative  to  the 
simple  ANOVA,  however,  these  methods  require  substantially  more  subjects  than  trials  (cf., 
Morrison,  1967).  Additionally,  the  changing  nature  of  what-is-belng-measured ,  is  signalled  by 
differentially  unstable  reliability  correlations  (cf.,  Alvares  S  Hulin,  1972)  which  in  turn 
makes  attribution  of  effect  difficult  if  not  impossible  (Bittner,  1979).  Obviously,  short  of  a 
major  paradigm  shift  the  stability  criteria  specified  above  must  be  met. 

Differential  stability  of  the  reliability  correlations  is  not  the  only  feature  to  look  at 
in  evaluation  of  tasks  for  PETER.  "Task  definition",  (Jones,  1979)  ,  the  absolute  magnitude  of 
the  reliability  (r)  after  stabilization  is  also  considered.  Unless  task  definition  is  sub¬ 
stantial,  sensitivity  to  differences  between  conditions  may  be  poor.  This  may  be  seen  on 
examination  of  Equation  (1)  which  compares  control  and  experimental  condition  means  M  and  M  , 

where  the  respective  standard  deviations  are  SD  and  SD  ,  and  vtiere  r  is  the  inter-trial 

c  e  cc 

correlation.  With  equal  standard  deviations,  the  standard  error  (1)  may  be  seen  to  approach 
zero  as  the  retest  reliability  approaches  r  **  1.00.  Conversely,  the  absence  of  reliability 
(r  *  0)  implies  that  the  size  of  the  denominator  (1)  will  be  equivalent  to  the  use  of 
independent  groups.  Indeed,  when  the  reliability  is  low,  (r(.40)  a  few  more  subjects  in  each 
of  tw  independent  samples  will  result  in  more  precision  of  the  error  term  than  is  derived  by 
repeated  measures  on  the  same  subject.  Caution  should  be  employed  in  examinations  of  task 
definition  and  care  should  be  taken  to  consider  the  time  required  to  obtain  a  particular  task 
datum.  The  Spearman-Brown  adjustment  (Allen  &  Yen,  1972,  p.{79)  and  similar  approaches  imply 
that  increased  data  sampling  will  increase  reliability  hence,  task  reliability  can  be  improved 
by  increasing  data  collection  time.  Notwithstanding,  task  definition  is  employed  in  PETER  but 
this  must  be  tempered  by  consideration  of  the  time  required  for  taking  measurements. 

t  -  (Mc  -  Mg)/  j  (SD^  ♦  SD*  -  2rceSDcSDe)/N  (1) 
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RESULTS 


Overview 

Thus  far,  48  tasks  have  been  studied.  Thirty  of  these  have  been  completely  analyzed  and 
copies  of  the  data  can  be  obtained  on  request.  The  remainder  are  in  various  stages  of  com¬ 
pletion  with  preprints  available  for  10  of  them.  The  studied  tasks  tap  functions  from  many 
areas  of  the  human  performance  spectrum  and  have  been  drawn  from  a  number  of  collections  of 
tasks  including:  Rose  (1974);  Ekstrom,  French,  Harman  &  Herman  (1976);  Wechsler  (1955)  and 
others.  It  is  suspected  that  as  many  as  200  total  tests  will  eventually  need  to  be  evaluated 
in  this  way  but  a  preliminary  battery  could  be  selected  now  on  the  basis  of  available  findings. 
Results,  reported  below,  will  center  around  two  tasks  which  are  representative  of  those  studied 
thusf  ar . 

Representative  Tasks 

Air  Combat  Maneuvering.  Figure  1  shows  the  means  and  standard  deviations  for  the  Air 
Combat  Maneuvering  (ACM)  task  (Jones,  Kennedy  &  Bittner,  in  preparation)  over  15  days.  The 
means  increase  steadily  through  Day  14  with  the  increase  being  greatest  during  the  first  four 
days.  Days  14  and  15  are  the  same.  The  standard  deviations  increase  slightly  through  Day  5 
and  then  remain  constant  through  Day  15.  Figure  2  is  constructed  from  Table  4.  Although 
correlations  throughout  the  matrix  are  high  (r).70)  the  earlier  days  (1,  2,  &  4)  are  lower  and 
more  variable  than  later  days.  Base  Day  6,  10  &  12  correlations  are  over  .90  and  remain  con¬ 
stant  with  those  following  indicating  differential  stability. 

Time  Estimation.  The  means  and  standard  deviations  for  the  Time  Estimation  Test  (McCauley, 
Kennedy,  6  Bittner,  1979)  are  shown  in  Figure  3.  Both  means  and  standard  deviations  appear 
approximately  level  throughout  the  experiment,  with  the  standard  deviation  covarying  with  the 
small  fluctuations  of  the  mean.  Figure  4,  which  was  constructed  from  Table  5  shows  the  reli¬ 
abilities  of  selected  base  days  and  those  following  for  the  Time  Estimation  Test.  Although  the 
reliabilities  between  adjacent  days  appear  satisfactory,  the  reliabilities  for  Base  Days  l 
through  11  tend  to  decrease  as  a  function  of  increasing  days  of  separation,  with  correlations 
for  the  earlier  days  falling  off  more  quickly  and  more  dramatically.  Correlations  for  Base  Day 
12  and  those  following  were  high  (r  *  .85)  and  a  relatively  shallow  decrease  in  correlations 
with  following  days  is  seen. 


TABLE  4 
'm 

Air  Combat  Maneuvering  Task  (ATARI  1):  Reliabilities  Over  15  Days  (N-13) 


Days 

2  3  4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

1 

.85  .77  .73 

.88 

.82 

.79 

.81 

.77 

.73 

.72 

.81 

.76 

.73 

.77 

2 

.92  .87 

.84 

.83 

.73 

.82 

.76 

.70 

.73 

.77 

.76 

.74 

.74 

3 

.90 

.88 

.84 

.73 

.80 

.81 

.70 

.81 

.73 

.79 

.74 

.78 

4 

.88 

.88 

.84 

.87 

.86 

.82 

.91 

.85 

.89 

.86 

.86 

5 

.95 

.91 

.95 

.94 

.90 

.93 

.91 

.93 

.89 

.92 

6 

.93 

.97 

.98 

.92 

.91 

.94 

.94 

.94 

.95 

7 

.97 

.92 

.93 

.94 

.96 

.94 

.93 

.96 

8 

.95 

.95 

.93 

.97 

.94 

.94 

.96 

9 

.92 

.94 

.93 

.94 

.94 

.94 

10 

.93 

.98 

.94 

.93 

.94 

11 

.93 

.96 

.94 

.95 

12 

.95 

.95 

.96 

13 

.98 

.98 

14 


97 


TABLK  5 


Time  Estimation:  Constant  Error  (CK)  Reliabilities  Over  1 5  Days  (n*19) 


8  9  10  11  12  11  14  15 


1 

.80*  .40 

-.14 

.08 

-.04 

.16 

.08 

2 

.59 

.22 

.34 

.28 

.44 

.40 

3 

.67 

.73 

.49 

.54 

.37 

4 

.70 

.69 

.65 

.53 

5 

.80 

.65 

.62 

6 

.83 

.87 

7 

.79 

03 

-.12  - 

.  19 

-.05 

-.21 

-.26 

-.24 

30 

.14 

.07 

.16 

-.05 

-.02 

-.07 

20 

.09 

.12 

.16 

.12 

.06 

.03 

38 

.28 

.25 

.27 

.28 

.  19 

.12 

55 

.38 

.32 

.42 

.37 

.36 

.28 

82 

.63 

.57 

.57 

.52 

.55 

.37 

70 

.61 

.53 

.61 

.53 

.46 

.39 

94 

.80 

.75 

.72 

.57 

.66 

.47 

.84 

.73 

.  72 

.54 

.62 

.46 

.76 

.90 

.82 

.78 

.78 

.75 

.61 

.70 

.54 

.88  .84  .83 

.89  .96 

.90 


Figure  3.  Time  Estimation  CE  score 
means  (x)  and  standard  deviations 
over  15  days  (n-19). 


Figure  4.  Time  Estimation  CE  score 
reliabilities  between  selected  base  (SD) 
days  (1,  2,  4,  8,  10  f»  12)  and  those 
following  over  15  days  (0*18). 


DISCUSSION 


Changes  in  mean  performance  for  the  most  part  improve  over  the  15  days  of  an  experiment 
for  nearly  all  tasks  studied  by  the  PETER  paradigm.  Figure  1.  ACM  task  from  the  Atari  series 
of  video  games,  is  characteristic  of  what  we  find  routinely,  viz.,  a  learning  curve.  Of  note 
is  that  this  test  represents  30  rainutes/day  for  three  weeks,  a  lot  of  practice.  Contrast  this 
function  with  Time  Estimation  (Figure  3)  where  no  learning  curve  is  apparent,  suggestive  of  a 
more  desireable  test  from  the  standpoint  of  mean  stability.  In  addition,  a  comparison  of  the 
standard  deviations  of  both  tasks  show,  if  anything,  greater  stability  for  the  Time  Estimation 
task.  However,  comparison  of  Figures  2  and  4  which  contain  traces  of  correlation  coefficients 
for  these  tasks  tell  a  radically  different  story.  Differential  stability  of  the  ACM  task  is 
obtained  early  and  is  of  substantially  greater  magnitude  than  the  marginally  stable  Time  Esti¬ 
mation  test.  These  two  tests  underscore  the  importance  of  the  reliability  -  a  neglected  sta¬ 
tistic  in  performance  testing  in  adverse  environments. 

Not  all  tests  behave  similarly  and,  all  combinations  of  mean  and  standard  deviation 
changes  can  occur  with  or  without  stabilized  correlations.  In  addition,  results  have  shown 
that  less  than  half  of  the  tests  which  have  been  so  studied  (Jones,  1979)  meet  the  criteria  of 
stabilized  reliability  correlations.  Of  the  ten  tasks  which  have  been  reported,  six  tasks 
stabilize  quickly  and  have  acceptable  task  definition:  Code  Substitution  (Wechsler,  1958),  ACM 
from  the  Atari  video  game  system  (Jones,  Kennedy  &  Bittner,  in  preparation).  Grammatical 
Reasoning  (Rose,  1974),  Arithmetic  (Seales,  Kennedy  &  Bittner,  1979),  S troop  Color-Words 
(Harbeson,  Kennedy  &  Bittner,  1979),  and  Two-Dimensional  Tracking  (Damos,  1979).  Critical 
Tracking  (Damos,  Kennedy  &  Bittner,  1979)  also  stabilizes  with  acceptable  task  definition  but 
findings  are  less  clear  cut.  Arithmetic  is  best  in  magnitude  and  quickness  of  correlational 
stability  and  ACM  is  next  best.  Four  tasks:  Complex  Counting,  (Kennedy  &  Bittner,  1979),  Time 
Estimation,  Letter  Search,  and  the  Spoke  Trail-Making  Test  (Kennedy  &  Bittner,  1978)  either  do 
not  stabilize  or,  if  they  do,  have  unacceptably  low  task  definition. 

In  conclusion,  half  of  the  tests  we  have  studied  lack  differential  stab il izat ion  as  re¬ 
vealed  by  examining  the  correlations.  Given  that  this  result  occurs  In  tasks  which  may  have 
stable  means  and  standard  deviations  and  were  largely  drawn  from  established  batteries,  it 
might  be  conjectured  that  of  all  previous  invest igations  in  adverse  environments,  many  may  have 
been  conducted  employing  unstable  tasks.  Differential  stability,  as  discussed  earlier,  is 
required  for  valid  and  meaningful  analysis.  It  is  believed  that  when  the  results  of  environ¬ 
mental  studies  have  been  based  on  tasks  not  shown  as  differentially  stable  or  employing  inde¬ 
pendent  groups  designs,  these  studies  should  be  seriously  questioned  or  critically  re-examined. 
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ABSTRACT 

When  individuals  are  required  to  work  in  arduous  environments,  such  as  may  be  encountered  aboard  ship, 
productivity  and  well-being  can  be  reduced.  The  Performance  Evaluation  Tests  for  Environmental  Research  (PETER) 
battery  is  being  designed  to  monitor  the  effects  of  such  unusual  environments.  In  the  PETF.R  program,  Jones' 
rate-terminal  theory  of  skill  acquisition  is  being  applied  to  the  study  of  complex  human  performance  and  abili¬ 
ties.  This  model  was  originally  derived  from  studies  of  motor  skill  acquisition  and  permits  isolation  of  perfor¬ 
mance  Into  two  elements,  one  relating  to  the  acquisition  stage  of  training  and  the  other  to  the  capacity  of  the 
individual.  The  reliability  of  most  test  batteries  has  been  determined  over  only  two  or  three  administrations, 
which  assumes  that  stable  (unchanging)  abilities  are  being  measured.  Task  performance,  however,  generally 
changes  with  practice.  Unless,  therefore,  a  task  has  been  practiced  until  between-sub ject  differences  cease  to 
change,  it  cannot  be  used  reliably  to  measure  environmental  (or  any  other)  effects.  During  the  early  trials  on 
a  test,  subjects  improve  at  different  nates  and,  after  extended  practice,  arrive  at  different  terminal  levels  of 
skill.  If  subjects  are  tested  for  environmental  effects  during  the  acquisition  phase,  it  is  not  possible  to 
tell  whether  differences  in  performance  are  due  to  individual  differences  or  to  differences  in  exposure  and 
transfer.  It  is  only  when  a  test  is  stable,  that  Is,  when  mean  performance  levels  off  and  the  rank  order  of 
subjects  ceases  to  change,  that  a  test  can  measure  environmental  effects.  Findings  from  the  sixty  tests  which 
have  been  administered  in  a  fifteen  day  repeated-measures  paradigm  support  the  rate-terminal  theory  of  skill 
acquisition.  Examples  from  two  of  these  tests  are  presented  to  illustrate  the  methodological  approach  we  employ 
for  the  study  of  complex  mental  functions.  The  application  of  these  methods  to  selection  and  training  research 
is  suggested,  and  the  critical  re-examination  or  reinterpretation  of  human  performance  studies  which  have  not 
taken  repeated  measures  problems  into  consideration  is  recommended. 


INTRODUCTION 

Environmental  stressors  which  are  experienced  in 
Navy  workplaces,  such  as  aboard  ship,  may  reduce 
well-being  and  productivity.  The  gross  effects  of 
such  arduous  environments  are  readily  observable,  but 
in  order  to  detect  subtle  effects  a  sensitive  measuring 
instrument  is  necessary.  Such  a  testing  device  could 
be  used  to  predict  the  onset  of  decrements  in  perfor¬ 
mance,  to  select  resistant  personnel  or  to  explore  the 
possibility  of  training  people  to  become  more  resistant. 
The  Performance  Evaluation  Tests  for  Environmental 
Research  (PETER)  battery,  which  is  being  developed 
primarily  to  study  ship  motion,  is  being  designed  to 
be  sensitive  to  subtle  changes  in  performance.  It  is 
our  opinion  that  this  type  of  sensitivity  has  not  been 
achieved  in  past  human  performance  studies  because 
adequate  attention  has  not  been  given  to  the  effects 
of  practice. 

Several  years  ago,  Jones  (1970a,  1970b)  proposed 
a  two  process  theory  to  describe  the  acquisition  of 
motor  skills.  The  theory  posited  an  acquisition 
phase,  in  which  persons  improve  at  different  rates  and 
a  terminal  phase  in  which  persons  reach  or  approximate 
their  individual  limits.  The  theory  therefore  speci¬ 
fies  (and  experimental  data  support)  that  different 
persons  begin  at  different  points  initially  and  arrive 
at  different  final  values  via  different  pathways.  The 
theory  further  implies  that,  to  the  extent  that  the 
terminal  process  is  reached,  persons  will  cease  to 
change  positions  relative  to  each  other  despite  addi¬ 
tional  practice.  In  other  words,  several  Individuals 
may  approach  a  task  with  differing  experience  levels 
and  capacities,  both  of  which  influence  their  initial 
scores*.  As  practice  continues,  previous  experience 
will  begin  to  contribute  proportionately  less  to  a 
person's  score,  and  individual  differences  in  learning, 
or  the  readiness  with  which  a  person  acquires  his  best 
performance,  begins  to  influence  his  test  score  more. 

As  the  amount  of  experimental  time  increases  propor¬ 


tional  to  previous  practice,  and  as  learning  progresses, 
differences  between  subjects  will  become  more  attri¬ 
butable  to  actual  differences  in  underlying  ability, 
or  capacity  until  finally  ,  the  amount  of  ability  is 
largely  what  governs  performance  scores.  Thus,  an 
inter-session  correlation  matrix  would  present  a 
distinctively  different  appearance  if  performance 
e_a_rl_y_  versus  _la_te  in  practice  was  examined.  Early  in 
practice  one  would  ordinarily  observe  the  superdiagonal 
form  (Jones,  1969)  in  which  correlations  between 
adjacent  trials  would  be  higher  than  comparisons  which 
are  more  remote.  Secondly,  correlations  of  immediately 
adjacent  trials  (e.g.,  1,2;  2,3;...)  would  be  higher 
later  (e.g.,  trials  10,11)  rather  than  earlier  (e.g., 
trials  2,3)  in  practice.  Late  in  practice,  if  the 
theory  holds,  the  correlation  coefficients  would 
become  constant  if  the  terminal  process  is  reached  so 
that  no  systematic  differences  would  be  present  in  the 
matrix  as  a  function  of  temporal  separation.  If  the 
terminal  process  is  not  reached,  then  the  matrix  will 
continue  to  show  superdiagonal  form  (Jones,  19p9). 

This  concept  is  Important  for  statistical  as  well  as 
theoretical  reasons.  Repeated  measures  analysis  of 
variance  requires  symmetry  of  the  variance-covariance 
matrix  and  if  learning  is  not  accomplished  during 
pretesting  then  systematic  changes  as  described  above 
can  make  interpretation  of  data  using  an  ANOVA  model 
(Winer,  1971;  Morrison,  1967)  difficult  or  impossible. 
Therefore  the  rate-terminal  process  theory  provides 
theoretical  underpinning  for  a  statistical  requirement. 
Moreover,  it  provides  a  way  of  looking  at  the  results 
In  order  to  determine  whether  stability  of  performance 
is  attained. 

The  PETER  program  was  begun  to  standardize  a 
performance  test  battery  in  order  to  study  the  effects 
of  adverse  environments  on  humans  (Kennedy  S  Bittner, 
1977).  It  is  desireahle  that  the  tests  in  the  hatterv 
assess  complex  mental  abilities  which  could  be  related 
as  elements  of  Navv  jobs.  A  natural  consequence  of 
research  in  this  area  of  environmental  stress  is  that 
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generally  each  subject  serves  as  his  own  control  over 
many  sessions.  in  other  words,  repeated  measures 
analysis  of  variance  is  required.  Moreover,  within 
the  context  of  the  Jones'  theory,  performance  on  all 
tasks  within  the  battery  should  be  at  terminal  levels 
before  an  experimental  treatment  is  introduced,  in 
order  that  the  changes  which  occur  may  be  correctly 
and  differentially  attributed  to  the  faculty  or  ability 
being  tested.  To  our  knowledge,  no  battery  of  perfor¬ 
mance  tasks  exists  which  would  permit  this  inference 
to  be  made.  Many  batteries  of  primary  mental  abilities 
have  been  developed  and  most  have  been  factor  analyzed 
(cf.  Carter,  Kennedy,  &  Bittner,  1980a,  for  a  review). 
None  of  these  ha*  been  examined  in  terms  of  stability 
of  suhtests  over  sessions*,  and  generally  the  factor 
analyses  which  were  performed  were  conducted  on  at 
most  two  replications.  Recently,  reviews  of  mean 
performance  changes  on  WAIS  (Thompson,  1975)  and  SAT 
(Nader  Releases  ETS  Report,  1980)  repeated  testings 
have  suggested  that  these  tasks  also  may  be  less 
stable  than  previously  considered.  In  WAIS,  SAT,  and 
factor  analyzed  batteries* reports,  cross-session 
correlations  of  subtests  are  ordinarily  not  reported 
for  more  than  2  or  3  sessions.  These  issues  bear 
directly  on  the  standardization  of  a  performance  test 
battery  for  studying  environmental  stress;  because 
stable  mental  abilities  as  well  as  stable  performance 
skills  will  need  to  be  measured  in  such  a  battery. 
Stability  can  only  be  determined  empirically  by  testing 
over  sessions.  Thus,  the  question  arises  as  to  whether 
the  rate-terminal  process  theory  would  provide  a 
useful  framework  in  which  to  evaluate  the  suitability 
of  tests  of  simple  and  complex  mental  work.  Specifi¬ 
cally,  do  people  exhibit  differential  rate  processes 
when  faculties  such  as  short  term  memory  (Sternberg, 
1966),  grammatical  reasoning  (Baddeley,  1968),  or 
visualizat ion  (Ekstrom,  French,  Harman,  &  Derman, 

1976)  are  tested,  in  the  same  way  that  they  acquire 
the  skill  of  turning  a  crank  or  pushing  a  lever  (Jones, 
1969). 

METHOD 

The  PETER  paradigm  which  has  been  described  in 
detail  elsewhere  (Harbeson,  Kennedy,  &  Bittner,  1979; 
Kennedy,  Bittner,  &  Harbeson,  1980;  Kennedy,  Carter,  & 
Bittner,  1980)  entails  testing  approximately  20  persons, 
usually  15  minutes  a  day  each,  for  15  days  on  a  series 
of  tests  of  skills  and  abilities.  Group  means  and 
standard  deviations  between  subjects,  and  cross-session 
correlations  are  examined  to  determine  whether  they 
meet  set  criteria.  The  tests  under  study  for  potential 
inclusion  in  PETER  are  selected  on  the  basis  of  meeting 
one  or  more  of  the  following  criteria:  (a)  the  test 
appears  in  a  factor  analyzed  battery,  (b)  the  lest 
measures  an  information  processing  construct  supported 
by  a  body  of  research,  (c)  performance  on  the  test  has 
been  experimentally  disrupted  in  an  adverse  environmen¬ 
tal  condition  of  interest  to  the  Navy  (viz.,  motion, 
thermal,  pressure),  (d)  the  task  taps  a  factor  related 
to  Navy  jobs,  or  (e)  the  test  is  intrinsically  moti¬ 
vating  (cf.  Carter  et  al .  1980a  for  additional  informa- 
t ion)  . 

RESULTS 

Thusfar  sixty  tests  have  been  examined  for  stabi¬ 
lity.  A  preliminary  report  covering  fifteen  has  been 
presented  elsewhere  (Kennedy,  Carter,  &  Bittner, 


1980).  What  follows  are  examples  of  two  tasks  which 
make  qualitatively  different  demands  of  subjects. 

One,  Grammatical  Reasoning  (Garter,  Kennedy,  &  Bittner, 
1980b)  is  a  cognitive  test,  and  the  other  a  video 
game.  Air  Combat  Maneuvering  Clones,  Kennedy,  &  Bittner, 
1980,  in  press)  is  largely  a  psvehomotor  task.  Figure 
1  shows  mean  and  standard  deviation  performances  for 
the  Air  Combat  Maneuvering  task.  It  may  be  seen  that, 
typical  of  learning  curves,  the  means  increase  dramati¬ 
cally  over  the  first  few  (five)  sessions,  and  that  the 
rate  of  improvement  becomes  constant  thereafter. 

Table  1  contains  the  cross-session  correlations  for 
this  test.  It  is  considered  representative  of  the 
motor  skill  tasks  examined  thusfar  and  follows  the 
generic  descriptions  of  Jones  (1980).  Early  in  prac¬ 
tice,  the  correlations  degrade  along  each  row,  but 
later  in  practice  (viz.,  in  this  case,  after  Day  6) 
the  correlations  appear  symmetrical.  That  is,  compari¬ 
sons  6  days  apart,  (i.e.,  between  Days  14  and  8)  are 
the  same  as  those  close  together  (viz.,  between  Days 
13  and  14).  Note  also  that  the  superdiagonal  form 
(Jones,  1969)  is  absent  after  Day  6.  Figure  2  shows 
data  we  consider  representat ive  of  the  cognitive  tests 
that  we  have  studied.  The  group  means  and  between- 
subject  standard  deviations  for  Grammatical  Reasoning 
also  show  a  learning  curve,  and  Table  2  shows  similar 
form  but  lower  correlations  (e.g.,  task  definitions) 
than  Table  1**.  Day  15,  the  last  day,  contains  anoma¬ 
lous  results,  a  common  finding  in  our  15  day  paradigm. 
Discounting  Day  15,  symmetrical  correlations  appear  by 
Day  6  in  Table  2  and  are  comparable  to  those  shown  in 
Table  1.  The  reasons  for  these  systematic  changes  in 
correlation  matrices  are  now  described.  Figure  3 
shows  a  scatter  plot  of  individual  scores  for  the  23^ 
subjects  tested  over  the  15  days  on  Grammatical  Reason¬ 
ing.  The  overall  impression  is  of  a  learning  curve. 

Four  different  time-course  performances  were  exhibited 
by  these  subjects,  and  they  are  separated  into  classes 
in  Figures  4-7.  Figure  4  illustrates  subjects  whose 
scores  over  the  15  sessions  were  essentially  constant. 
Figure  5  shows  subjects  who  improve  with  practice  hut 
all  at  the  same  rate.  Figure  6  demonstrates  subjects 
whose  terminal  level  is  correlated  with  initial  level 
but  the  individuals  appear  to  improve  at  different 
rates.  Figure  7  reflects  the  full  complexity  of  the 
two  process  theory  whereby  individual  differences 
exist  for  initial  and  terminal  levels  as  well  as  for 
the  rates  of  learning. 

DISCUSSION 

Figure  7  is  typical  of  the  general  findings  of 
many  of  our  experiments.  That  these  outcomes  would 
emerge  for  motor  skill  acquisition  tasks  was  not  sur¬ 
prising.  However,  that  tests  of  information  processing 
and  tests  of  cognitive  abilities  would  follow  super¬ 
diagonal  form  has  not  been  commented  upon  previously 
to  our  knowledge.  These  findings  have  profound  impli¬ 
cations  not  only  for  experiments  into  adverse  environ¬ 
ments  but  also  for  all  other  studies  which  follow  a 
repeated  measures  design  and  where  systematic  change 
in  cross-session  correlations  may  occur.  While  time 
course  changes  similar  to  those  of  Figures  4,  5,  and  6 
are  available  In  our  work,  they  are  the  exception.  On 
the  other  head,  the  following  exhibit  data  like  Figure 
7:  Digit  Span  (McCafferty,  Bittner,  &  Carter,  1980), 
Code  Substitution  (Pepper,  Kennedy,  Bittner,  Wiker, 

1980),  Copying  (Moran,  Kimble  &  Mefferd,  1964)  Letter 
Rotation  and  other  mental  ability  tests;  Letter  Search 


*An  exception  may  be  the  Alluisi  and  Chiles  (1967)  battery  which  may  have  been  subjected  to  a  stability  analysis 
during  its  early  development.  However,  the  inter-session  and  Intertask  reliabilities  have  not  been  reported  to 
our  knowledge. 

**When  these  two  tests  are  normalized  (using  a  5  minute  base) 
are  slightly  poorer  than  Grammatical  Reasoning. 
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for  their  d i spar i to  test  lengths,  ACM  correlations 


(Rost*,  1*174),  Item  Recognition  (Carter,  Kennedy, 

Bittner,  t»  Krause,  1980),  and  other  information  proces¬ 
sing  tasks;  Free  Recall,  Running  Recognition,  and 
other  memory  tests  (llarheson,  Krause,  6.  Kennedy, 

1980).  Not  surprisingly,  the  following  psyehomo tor 
tests  also  show  superdiagonal  form  over  various  periods 
of  a  1  S  day  experimental  paradigm:  Critical  Tracking 
(Damos,  Kennedy,  &  Bittner,  1979)  Trail  Making  (Kennedy, 
Bittner,  &  Kinbender,  1980)  as  well  as  several  in  a 
family  of  video  games  (Jones,  et  al .  1980a,  in  press). 
Arithmetic  (Seales,  Kennedy,  &  Bittner,  1980)  shows 
data  like  Figure  *>  and  is  an  example  of  a  test  which 
stabilizes  early.  Only  two  studies  from  our  data 
provide  examples  which  resemble  Figure  4  and  these. 

Time  Kstimation  (McCauley,  Kennedy,  &  Bittner,  in 
press)  and  Complex  Counting  (Kennedy  &  Bittner,  1980) 
show  late  if  any  stabilization  of  the  correlations, 
possibly  because  knowledge  of  results  is  not  provided 
in  those  tests.  Compensatory  tracking  (Damos,  Kennedy, 

&  Bittner,  1980,  in  press)  exhibits  high  correlations 
between  initial  and  terminal  (r  *  >  .80)  performance 
(cf.  Figure  6)  but  as  many  persons  reach  their  best 
performance  early  (<  30  trials)  as  late  ( > 70  trials) 
in  practice. 

We  feel  that  the  implications  of  our  findings  are 
best  viewed  by  references  to  the  illustrations  shown 
in  Figures  4-7.  For  example,  when  Factor  Analytic 
Studies  of  ^'imary  mental  abilities  are  conducted  by 
others  usin0  large  samples  on  several  paper  and  pencil 
tests  but  over  only  1  and  2  administrations,  it  is 
implicit  that  time  course  changes  in  individual  perfor¬ 
mance  follow  either  Figure  4  or  at  least  Figure  5. 

Yet,  our  data  strongly  suggest  that  tests  which  now 
appear  in  factor  analyzed  batteries  often  do  not 
stabilize  until  after  several  administrations.  This 
means  that  in  previous  factor  analyses,  the  "primary 
mental  ability"  which  emerged  may  have  been  compli¬ 
cated  by  individual  differences  in  learning  the  "primary 
mental  ability"  test. 

Implications  similarly  exist  for  Selection  and 
Training  Research  where  scores  are  used  to  predict 
subsequent  performance.  In  these  cases,  it  is  essential 
that  the  initial  scores  be  stable  attributes  of  an 
individual  because  the  test-retest  reliability  of  a 
selection  test  score  (e.g.  spatial  apperception)  is 
the  expected  upper  limit  of  the  correlation  of  that 
score  with  an  external  criterion  (e.g.  success  as  an 
aviator).  Thus,  Selection  and  Training  Research  hopes 
for  outcomes  like  Figure  4  but  admits  of  outcomes  like 
Figure  5  or  6.  However,  to  the  extent  that  Figure  7 
can  be  expected  to  occur  on  selection  tests  and  during 
training  regima,  inefficiency  in  prediction  will 
result  If  predictor  scores  are  unstable  when  related 
to  a  criterion.  Obtaining  test  scores  will 

assuredly  improve  predictive  validity.  We  feel  that 
implications  also  exist  for  Experimental  Psychology, 
particularly  for  information  processing  and  perception 
studies.  When  that  discipline  employs  repeated  measures 
designs,  it  often  attempts  to  control  the  exposure 
history  of  subjects  (nr  counterbalance)  to  account  for 
sequence  effects.  Our  data  show  that  far  more  practice 
than  is  usually  provided  is  necessary  for  stability. 

When  inferences  about  particular  hypothetical  constructs 
(e.g.  amphetamine  modifies  short  term  memory)  are  to 
be  made.  It  Is  necessary  to  have  the  subjects'  stable 
performances  of  short  term  memory  (i.e.  terminal 
process)  separated  from  their  acquisition  (i.e.  rate 
process)  prior  to  the  application  of  the  experimental 
treatment.  Only  in  this  way  may  the  experimental 
outcome  be  properly  referred  to  the  effect  of  the 
experimental  treatment  on  the  hypothetical  construct. 
Otherwise,  individual  differences  in  acquisition  are 
contaminated  with  the  individual  differences  in  the 
ability.  It  is  our  opinion  that  since  Figure  7  appears 
to  better  reflect  reality,  research  in  the  aforemen¬ 


tioned  fields  of  psychology  should  reexamine  findings 
with  this  In  mind.  Moreover,  it  is  our  view  that 
studies  which  report  mean  differences  for  asymptotic 
performances  between  ages,  sexes  and  races,  should 
also  determine  whether  Individual  learning  curves  are 
similarly  shaped  or  consider  the  possibility  that 
subjects  are  merely  following  different  paths  to 
stability.  It  is  possible  that  practice  (previous 
experience)  would  account  for  proportionately  more 
differences  in  performance  than  the  different  basic 
abilities  of  the  groups.  If  so,  there  could  be  dramatic 
practical  advantages.  For  example,  training  persons 
with  poorer  ability  may  result  in  greater  increases  in 
performance  at  less  cost  than  selecting  persons  with 
high  ability  initially. 

A  test  battery  such  as  PETER  could  serve  as  a 
useful  tool  in  assessing  the  effects  of  the  work 
environment.  Each  individual  would  practice  to  asymp¬ 
tote  on  tests  of  various  skills  and  abilities,  and 
subsequently  be  tested  in  the  work  environment.  Thus, 
it  would  be  possible  to  determine  subtle  changes  in 
performance  for  a  particular  individual,  or  for  a 
particular  function  of  that  individual.  Such  a  test 
battery  could  be  used  to  monitor  the  daily  effects  of 
a  hazardous  environment,  in  which  individuals  were 
working,  or  for  research  on  the  environment.  The 
results  of  such  testing  could  he  used  as  a  warning  to 
remove  workers  from  dangerous  conditions,  or  to  select 
resistant  workers,  or  to  redesign  the  workplace. 

CONCLUSION 

In  conclusion,  in  research  on  human  performance, 
it  is  important  to  consider  practice  effects.  If 
subjects  are  tested  during  the  acquisition  phase  of 
training,  it  is  not  possible  to  tell  whether  differ¬ 
ences  in  performance  are  due  to  individual  differences, 
or  are  caused  by  the  variable  being  studied.  It  is 
only  when  a  test  is  stable  that  is,  when  mean  perfor¬ 
mance  levels  off  and  the  rank  order  of  subjects  ceases 
to  change,  that  a  test  can  he  used  as  an  accurate 
measuring  device. 
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TABLES 

Table  1 

Cross-session  Correlations 

for  Air  Combat  Maneuvering  Test  Over  15  Days  (n  =  22) 


Day 

2 

3 

4 

3 

6 

7 

8 

q 

10 

1  1 

1 : 

13 

14 

IS 

l 

.78 

.68 

.39 

.73 

.72 

.64 

.70 

.67 

.61 

.34 

... 

.36 

.63 

.69 

2* 
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.79 

.81 

.77 

.68 

.  74 

.70 

.6Q 

.  n 

.64 

.  72 

3 

- 

.86 

.86 

.84 

.  76 

.91 

.  81 

.  79 

4 

- 

.87 

.86 

.83 

.84 

.  94 

.  83 

,  8*' 

.89 

.  8*i 

.  84 

3 

- 

.93 

.90 

.91 

.88 

.97 

.8« 

.94 

.90 

.91 

.91 

6 

- 

.91 

.94 

.  93 

.9? 

.91 

.  97 

.90 

.4^ 

.93 

7 

- 

.97 

.90 

.93 

.9) 

.43 

.  9  3 

.0: 

.43 

8 

- 

.90 

.93 

■  »- 

.  90 

.  94 

q 

- 

.  94 

.91 

.  84 

.93 

10 

- 

.91 

.4*1 

■-< 

.43 

1 1 

- 

.43 

.41 

.93 

i: 

- 

.93 

.  91 

.  94 

1  3 

- 

.  96 

14 

' 

.96 

fi  i~  iMlhfc  1  nan  ifrli  ti 
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TOTAL  COIIICT  £  NUMBBB  OF  HITS 


Table  2 


90 


t- 


Cross-session  Correlations 
Cranmatical  Reasoning  Test  Over  15  Days  (n  =  23) 
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2  5  A 
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.78  .67 
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.66 
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.63 

.43 

.52 

.57 

.46 

.70 

.32 

3 

.88 
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.88 

.65 

.78 

.79 

.67 

.63 

.73 

.64 

.76 

.37 

4 

- 

.78 

.86 

.89 

.11 

.76 

.65 

.66 

.38 

.11 

.43 

5 

- 

.86 

.82 

.82 

.86 

.76 

.82 

.86 

.83 

.91 

.68 

6 

- 

.83 

.85 

.83 

.79 

.76 

.76 

.80 

.86 

.61 

7 

- 

.82 

.79 

.83 

.77 

.77 

.95 
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.67 

8 

- 

.86 

.77 

.77 

.84 

.80 

.89 

.68 

9 

- 

.77 

.88 

.87 

.81 

.87 

.69 

10 

- 

.64 

.80 

.95 

.79 

.77 

11 

- 

.82 

.81 

.84 

.65 

12 

- 

.89 

.86 

.67 

13 

- 

.85 

.73 

U 

- 
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FIGURES 
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Figure  3.  Scatter  plot  of  total  correct  scores 
for  23  subjects  over  15  days  on  the  Grammatical 
Reasoning  Test. 
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Figure  4.  Individual  learning  curves  of  4 
subjects  (4,  B,  I,  N)  whose  scores  appear  constant 
over  15  days  on  the  Grammatical  Reasoning  Test. 


Figure  1.  Means  and  standard  deviations  for  Ait 
mbat  Maneuvering  Test  over  15  days  (N=22) . 


Ffgure  2.  Means  and  standard  deviations  for 
Grammatical  Reasoning  Test  over  15  days  (N-23). 


DAYS 


Figure  5.  Individual  learning  curves  of  4 
subjects  (3,  A,  C,  D)  who  improve  with  practice  at 
the  same  rate  over  15  days  on  the  Grammatical 
Reasoning  Test. 
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INTRODUCTION  RESULTS 


This  report  deals  with  the  progress  in  the 
development  of  a  Performance  Evaluation  Test  for 
Environmental  Research  (PETER),  a  program  motivated 
by  the  need  for  a  test  battery  which  is  suitable 
for  administration  through  extensive  repetitions 
(Kennedy  &  Bittner,  1977).  Nearly  all  studies  into 
unusual  environments  employ  sub  ject  s-as-t heir-own- 
control  to  the  extent  that  ‘'Environmental  Time- 
Course"  (ETC)  effects  may  be  considered  paradig¬ 
matic  of  a  class  of  studies  which  incorporates 
“repeatability"  as  a  characteri st ic  ingredient. 
Stated  differently,  with  these  paradigms  the  con¬ 
cern  is  chiefly  with  the  effects  of  an  environment 
on  performance.  The  effect  of  exposure  duration 
itself  is  nearly  always  included  as  an  unwanted 
consequence  of  the  experiment.  Although  much  re¬ 
search  has  been  conducted  using  an  ETC  paradigm, 
most  of  it  had  been  accomplished  with  batteries 
insufficiently  standardized  to  yield  unambiguous 
results.  Additionally,  related  literature  concern¬ 
ing  time  course  changes  in  skill  acquisition  (cf. 
Jones,  1962,  1969,  for  example)  could  profitably  be 
incorporated  into  those  studies  which  follow  an 
ETC  paradigm. 

Standardization  of  PETER  is  being  accomplished 
to  provide  intercorr elat ion  reliabilities  obtained 
over  15  days  of  testing,  in  addition  to  means  and 
standard  deviations  (Kennedy  &  Bittner,  1978).  The 
tests  which  have  been  selected  for  study  early  in 
our  program  sample  cognitive,  perceptual  and  infor¬ 
mation  processing  functions.  Psychomotor,  sensory 
and  physical  proficiency  tasks  will  be  studied 
later.  The  purposes  of  the  present  paper  are:  (1) 
to  describe  our  experiences  with  the  first  ten 
tasks  we  have  studied;  and  (2)  to  make  inferences 
about  the  implications  of  these  results  for  en¬ 
vironmental  research  in  general  . 

METHOD 

A  cadre  of  19  Navy  enlisted  men,  ages  19  to 
24,  were  tested  for  15  consecutive  weekdays.  Tests 
on  one,  or  at  most  two  of  the  ten  tasks  were  ad¬ 
ministered  each  day,  with  testing  performed  in  the 
morning  between  8  a.m.  and  10  a.m.  Subjects  were 
monitored  for  fitness  by  a  team  of  physicians.  All 
volunteer  subjects  were  recruited  and  evaluated  in 
accordance  with  procedures  specified  in  Secretary  of 
the  Navy  Instruction  3900.39  and  Bureau  of  Medicine 
and  Surgery  Intruction  3900.6  which  require  volun¬ 
tary  informed  consent  and  meet  or  exceed  the  most 
stringent  provisions  of  all  prevailing  national  and 
international  guidelines. 


Complex  Counting  Test  (Kennedy  &  Bruns,  1 975 ) 

Results  for  this  test  are  shown  in  Figures 
1  and  2.  Both  mean  scores  and  standard  deviations 
(Figure  I)  were  relatively  level  (within  10 
percent)  over  the  three  weeks  of  testing.  Correla¬ 
tions  are  shown  in  Figure  2,  where  performances 
on  selected  base  days  (Days  1,  2,  4,  9,  and  13) 
are  compared  with  each  subsequent  day,  not  only 
in  order  to  determine  intertrial  reliability  of 
a  particular  day's  performance,  hut  also  to 
monitor  series  effects  in  these  reliabilities. 
Examining  Figure  2,  it  may  be  seen  that  the 
reliability  of  Day  4  with  subsequent  days  is 
very  good  ( r  *  >.85).  Tn  Table  I,  the  ANO^A 
shows  a  significant  subjects  effect  (p<10  ) 

but  a  nonsignificant  (p>.10)  days  effect. 


Table  1 


AN0VA:  Complex  Counting 


SOURCE 

DF 

MS 

F 

P 

DAYS 

14 

30.23 

0.58 

NS  r 

SUB  JS 

18 

2510.  51 

48.  51 

<10 

RES  ID 

252 

51.75 

Grammatical  Reasoning  Test  (Baddeley,  1968; 
Rose,  1974) 


The  results  are  shown  in  Figures  3  and  4. 
Examining  Figure  3,  it  may  be  seen  that  both 
means  and  standard  deviations  of  performance 
increase  over  trials  at  a  declining  rate.  The 
ANOVA  in  Table  2  supports  this  learning  curve 
with  a  significant  days  effect  (p<10  ),  and 

alsg  shows  a  significant  subjects  effect  (p< 

10  ).  Reliabilities  are  shown  in  Figure  4  and 

are  moderate  when  comparing  Days  1  and  2  with 
other  days,  hut  very  good  (r>.80)  with  compari¬ 
sons  made  after  Day  4.  However,  all  reliabilties 
decline  over  sessions,  (cf  Jones,  1969)  and  the 
rates  of  decline  are  nearly  equivalent. 


Table  2 


ANOVA:  Grammatical  Reasoning 


SOURCE 

DF 

MS 

F 

P 

DAYS 

14 

99.20 

14.70 

<  1°:; 

SUBJS 

17 

277.88 

41.18 

<  10 

RESID 

238 

6.  75 

Research  performed  under  Navy  Work  Unit  No.  MF58. 524-002-5027.  The  opinions  are  those  of  the  authors 
and  do  not  necessarily  reflect  those  of  the  Department  of  the  Navy. 


Table  5 


('ode  Substitution  Test  (after  Weschler,  1955) 


The  results  appear  in  Figures  5  and  6.  Mean 
performance  (Figure  5)  (total  correct)  improves 
over  the  15  testing  administrations  but  appears 
to  decelerate  after  Day  9.  Standard  deviations 
(Figure  5)  appear  equal  after  Day  7.  Average 
correlations  (Figure  6)  for  subsequent  days  are 
poorest  for  Days  1  and  2.  The  reliability  of  Day 
4  with  later  days  is  about  .60.  Table  3  contains 
an  ANOVA  for  total  correct  and  shows  significant 
days  and  subjects  effects. 


ANOVA:  Stroop  Test  CB-CW  Score 


SOURCE 

DF 

MS 

F 

P 

DAYS 

14 

173.77 

5.43 

SUB  JS 

18 

198.71 

6.20 

<  10  ' 

RF.SID 

2  52 

32,03 

Arithmetic  Test 


Table  3 


ANOVA:  Code  Substitution 


SOURCE 

DF 

MS 

F 

P 

DAYS 

14 

504.09 

7.71 

<  l°~-5 

SUB  JS 

18 

1524.76 

23.32 

<10 

RESID 

2  52 

65.  37 

Stroop  Test 

(Jensen 

&  Ro  hwe  r  , 

1966) 

The  data  appear  in  Figures  7,  8,  9,  and  10. 

Mean  scores  for  three  directly  measured  perfor¬ 
mances  and  two  difference  scores  (derived)  are 
shown  In  Figure  7.  Performance  improves  for  10 
days  on  all  measures  but  appears  relatively 
asymptotic  thereafter.  Standard  deviation  scores 
are  found  in  Figure  8.  The  correlations  were 
highest  for  colored  blocks  (CB)  and  poorest  for 
the  derived  score  CB-CW  (Figure  10).  Correlations 
for  colored  words  ( CW) ,  the  most  commonly  used 
score,  are  shown  in  Figure  9.  The  present  test 
administration  differed  from  that  used  by  most 
other  investigators  in  that  response  keys  (vice 
verbal  responses)  were  used  and  test  administrations 
were  brief  (30  seconds),  and  may  have  been  a 
factor  in  obtaining  lower  reliabilities  than 
reported  elsewhere  (Jensen  &  Rohwer,  1966).  The 
ANOVAs  for  all  Stroop  scores  showed  significant 
subjects  and  days  effects  and  two  (CW  &  CB  -  CW) 
are  shown  in  Tables  4  and  5  respectively. 


This  was  a  paper  and  pencil  test  which 
alternated  arithmetic  operations:  three  digit 
addition;  three  digit  subtraction;  two  digit  by 
two  digit  multiplication;  and  four  digit  by  two 
digit  division.  Figure  11  shows  mean  performances 
which  appeared  to  be  reaching  an  asymptote  after 
10  days  of  testing.  The  standard  deviations 
also  shown  in  Figure  11  appear  to  increase 
throughout  the  experiment  suggesting  that  disper¬ 
sion  increases  over  sessions.  The  reliabilities 
(Figure  12)  are  generally  high  (r>.90)  and  do 
not  appear  to  decline  over  sessions.  Table  6 
shows  days  and  subjects  effects.  It  is  of 
interest  that  both  number  attempted,  number 
correct  and  number  right  minus  wrong,  reflected 
average  reliabilities  substantially  higher 
(r>.90)  than  percent  correct  of  number  attempted 
(r<.70)  for  the  same  data. 

Table  6 


ANOVA:  Arithmetic  Test 


SOURCE 

DF 

MS 

F  P 

DAYS 

14 

233.88 

8.17  <  IP's 

SUB  JS 

17 

4850.05 

169.35  <  10~5 

RESID 

238 

28.64 

Neisser 

Letter  Search 

(Neisser, 

Novick  &  Lazar, 

1963;  Rose,  1974) 


Results  are  shown  in  Figures  13  and  14. 

Mean  slope  scores  and  standard  deviations  shown 
in  Figure  13  appear  relatively  level  for  the 
Table  4  duration  of  the  experiment  although  with  some 

variability.  Means  and  standard  deviations  also 
ANOVA:  Stroop  Test  Color  Words  seen  to  co-vary.  In  Figure  14,  correlations 

- - - — - - — - - - - - - -  were  low  for  base  days  1,  2  &  4  (r  *<.50)  but 

SOURCE  DF  MS  F  P  appeared  higher  after  Day  9.  Table  7  shows 

_ _ _ _ _ _ _ _  significant  subjects  and  days  effects. 


DAYS 

14 

657.64 

29.11 

C  l°-5 

Table  7 

SUBJS 

18 

1356.63 

59.15 

<  10  5 

RESID 

2  52 

22.93 

ANOVA:  Letter  Search 

SOURCE 

DF 

MS 

F 

DAYS 

14 

.15 

8.93 

SUBJS 

17 

.13 

8.07 

RESID 

238 

.02 

Critic al  Track! ng  Test  (Jex,  McOonnel  1  &  Fha tak  , 
1966;  Rose,  1974) 

The  data  appear  In  Figures  15  and  16.  Mean 
scores  (Figure  15)  improve  for  the  duration  of 
the  experiment  but  at  a  declining  rate.  The 
plateau  on  Days  13  through  15  is  due  either  to 
performance  reaching  an  asymptotic  level  or  to 
the  subjects  anticipation  of  the  completion  of 
the  experiment.  The  standard  deviation  (Figure 
15)  was  relatively  constant  over  days.  The 
average  reliability  (Figure  16)  of  Days  1  and  2 
with  subsequent  days  is  far  lower  (r<.60)  than 
for  Day  4  and  thereafter.  The  decline  over  days 
is  very  apparent.  The  ANOVA  (Table  8)  shows 
significant  days  and  subjects  effects. 

Table  8 


ANOVA:  Critical  Tracking 


SOURCE 

DF 

MS 

F 

P 

DAYS 

14 

9.74 

49.87 

<  io"? 

SUB  JS 

17 

6.79 

34.76 

<  10 

RESTD 

238 

.20 

Subc  r  itical 

Two  Dimensional 

Compensatory 

Tr acking 

Tes  t 

This  test  was  administered  after  the  completion 
of  the  critical  tracking  test.  An  acceleration 
control  displacement  stick  was  used.  Mean  and 
standard  deviation  scores  reached^  a  plateau 
(Figure  17)  by  Day  5.  Reliabilities  (Figure  18) 
were  high  the  first  10  days,  but  apparatus  malfunc¬ 
tion  produced  a  dead  spot  on  the  CRT  which  was 
discovered  by  a  few  subjects  around  Day  10. 
Thereafter  reliabilities  degraded.  Both  subjects 
and  days  effects  were  significant  in  the  ANOVA 
(Table  9) . 

Table  9 

ANOVA:  Compensatory  Tracking  Test 


SOURCE  DF  MS  F  P 


DAYS 

14 

57.60 

42.69 

<I0'. 

SOB  JS 

17 

10.41 

7.  72 

<10  5 

RESID 

238 

1.35 

Time 

Estimation  (Gr 

aybi el ,  et  al 

.  ,  1965) 

Results  are  shown  in  Figures  19  and  20.  The 
means  and  standard  deviations  (Figure  19)  were 
relatively  level.  Table  10,  summarizes  the  ANOVA 
and  indicates  that  the  subjects  effect  was  signi¬ 
ficant,  however,  the  days  effect  was  not  signifi¬ 
cant.  These  results  support  data  obtained  pre 
vtously  (Grayblel ,  et  al . ,  1965). 


Table  10 


ANOVA: 

Time  Est imat ion 

est 

SOURCE 

DF 

MS 

F 

P 

DAYS 

14 

.88 

.85 

ns_5 

SUBJS 

18 

7.  51 

7.  30 

<10 

RESID 

2  52 

1.03 

Reliabilities  of  given  base  days  with  each 
subsequent  day  were  moderate  but  approached  zero 
with  additional  days  (Figure  20).  A  fine  grained 
analysis  (McCauley,  Kennedy,  &  Rittner,  in 
press)  shows  that  parts  of  this  test  have  higher 
reliabilities  (r>.90)  than  the  whole  test. 

Spoke  Test 

This  test  is  a  modification  of  the  Trail 
Making  Test  (Re it an,  1955)  and  has  a  psychomotor 
subtask,  the  control  task  (CT),  and  a  visual 
search  subtask,  the  experimental  task  (FT). 

Figure  21  shows  level  mean  scores  and  slight 
variability  in  standard  deviations  for  the  CT 
measure.  However,  the  days  effect,  in  addition 
to  the  subjects  effect,  was  significant  (Table 
11).  Figure  22  shows  level  and  moderately  high 
reliabilities  which  do  not  appear  to  increase  or 
decrease  with  trials  after  Day  2. 

Table  11 


ANOVA:  Spoke  Test  Control  Task 


SOURCE 

DF 

MS 

F 

P 

DAYS 

SUBJS 

14 

17 

28.09 

399.84 

3.13 

44.62 

<  10-5 

<  10 

RESID 

238 

8.96 

Figure  23  shows  improving  search  times  over  the 
first  few  days  and  relatively  level  performance 
thereafter.  The  ET  standard  deviations  were 
somewhat  variable.  Table  12  shows  significant 
days  and  subjects  effects  for  ET.  Reliabilities 
for  ET  (Figure  24)  were  lower  than  CT  (r<.30) 
for  Base  Day  4  and  "hereafter. 

Table  12 


ANOVA:  Spoke  Test  Experimental  Task 


SOURCE 

DF 

MS 

F 

P 

DAYS 

14 

1388.22 

5.24 

< 

SUBJS 

17 

2938.26 

11. 10 

<  10-5 

RESID 

238 

264. 68 

16 


DISCUSSION 


Fifteen  measures  on  ten  different  tests 
were  reported  in  this  study,  thirteen  of  which 
showed  significant  learning  (i.e.,  days)  effects. 
The  two  exceptions  were  Time  Estimation  and 
Complex  Counting,  replicating  findings  reported 
elsewhere  (Kennedy  &  Bruns,  1974;  Graybiel ,  et 
al . ,  1963).  The  greatest  practice  effects  appeared 
with  both  tracking  tests,  fol lowed  by  the  Stroop 
Test  and  the  Grammatical  Reasoning  Test,  tasks  on 
which  reaction  time  or  speed  of  manual  response 
could  contribute  to  the  total  score.  An  inspection 
of  the  variations  in  obtained  standard  deviations 
over  sessions  exemplifies  the  importance  of 
testing  control  groups.  Some  standard  deviations 
remained  level  after  a  few  days;  some  co-varied 
with  other  measures  of  performance;  and  other 
showed  no  systematic  trends  related  to  changes  in 
the  means  or  to  changes  in  the  reliabilities  of 
the  tests.  The  analyses  of  reliabilities  over 
extensive  testing  showed  that  they  were  suffi¬ 
ciently  high  for  the  inclusion  of  some  tests  in  a 
battery  in  their  present  form  (e.g.,  Complex 
Counting  and  Arithmetic)  and  suggest  that  longer 
tests  may  be  required  for  others  (Coding  and 
Spoke  ET).  In  some  cases,  reliabilities  degrade 
to  a  point  (Time  Estimation,  Stroop  derived 
scores)  that  it  is  unlikely  that  an  effect  however 
large,  could  be  shown  to  be  statistically  signifi¬ 
cant  if  the  test  were  employed  in  its  present 
form.  Previous  environmental  research  with  test 
batteries  can  be  questioned  based  upon  the  results 
shown  in  this  report.  The  decline  in  reliabi¬ 
lities  of  tasks  with  repeated  testing  shown  for 
most  tasks  in  this  study,  indicates  that  the 
'’factors"  measured  in  an  experiment  may  change 
over  time.  Control  groups  provide  protection 
against  changes  in  mean  performances,  but  the 
responses  of  subjects  in  both  experimental  and 
control  groups  may  reflect  one  "factor"  at  the 
beginning  (X)  and  another  at  the  end  (Y).  Differ¬ 
ences,  therefore,  may  be  due  to  mean  differences 
in  (X)  initially  and  in  (Y)  at  the  end.  High 
reliability  over  only  one  test  repetition  affords 
little  protection  from  this  problem.  Time  Estima¬ 
tion,  for  example,  showed  high  reliability  (r=.95) 
for  the  relationship  of  Base  Days  9  and  10  (i.e., 
after  eight  days  practice).  However,  the  full 
regression  ( to  r  *  .60)  with  only  six  administra¬ 
tions)  showed  that  only  by  long  term  studies, 
such  as  the  present  one,  can  experimental  tasks 
be  evaluated  for  meaningful  application  in  environ¬ 
mental  research. 
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Figure  1.  Complex  Counting  Test  means  and 
standard  deviations  for  percent  correct  over  15 
days  (n*19). 
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Figure  2.  Complex  Counting  Test  correlations  for 
selected  base  days  (1,  2,  4,  9,  13)  and  those 
following  for  percent  correct  over  15  days  (n=19). 


Figure  4.  Grammatical  Reasoning  Test  correlations 
for  selected  base  days  (1,  2,  4,  9,  13)  and  those 
following  for  total  correct  over  15  days  (n*18). 


Figure  5.  Code  Substitution  Test  means  and 
standard  deviations  for  total  correct  over  15 
days  (n=18). 


Figure  3.  Grammatical  Reasoning  Test  means  and 
standard  deviation  for  total  correct  over  15  days 
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Figure  6.  Code  Substitution  Test  correlations 
for  selected  base  days  (1,  2,  4,  8*  10,  12)  and 
those  following  for  total  correct  over  15  days 
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Figure  7.  Stroop  Test  means  for  number  of  re¬ 
sponses  on  five  measures:  black  and  white  words 
(BW) ,  color  blocks  (CB) ,  color  words  (CW) ,  CB-CW, 
and  BW-CB,  over  15  days  (n=19). 
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Figure  8.  Stroop  Test  standard  deviations  for  5 
measures  over  15  days  (n=19). 
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Figure  10-  Stroop  Test  correlations  for  selected 
base  days  (1,  2,  4,  9.  13)  and  those  following 
for  CB-CW  over  15  days  (n«19). 


— T  1 - 1 - 1 - T - t - 1 - -  - - - - - - - - - ’ - - 

1  2  3  4  5  6  7  8  9  10  11  12  1  i  14  16 

DAYS 

Figure  11.  Arithmetic  Test  means  and  standard 
deviations  for  total  correct  over  15  days  (n=18). 
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Figure  12.  Arithmetic  Test  correlations  for 
selected  base  days  (1,  2,  A,  8,  10.  12)  and  those 
following  for  total  correct  over  15  days  (n=18). 
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Figure  9.  Stroop  Test  correlations  for  selected 
base  days  (1,  2,  4.  9,  13)  and  those  following 
for  colored  words  over  15  days  (n-19). 


Figure  13.  Letter  Search  Test  means  and  standard 
deviations  for  time  per  Item  slope  over  15  days 
(n-18). 


Figure  14.  Letter  Search  Test  correlations  for 
selected  base  days  (1.  2,  4,  8,  10.  12)  and  those 
following  per  time  per  item  slope  over  15  days 
(n-18). 


DAY 


Figure  17.  Compensatory  Tracking 
standard  deviations  for  RMS  error 
(n-18). 
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Figure  15.  Critical  Tracking  Test  means  and  stan¬ 
dard  deviations  of  scores  over  15  days  (n=18) . 
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Figure  18.  Compensatory  Tracking  Test  correla¬ 
tions  for  selected  base  days  (1,  2.  4.  9.  13)  and 
those  following  for  RMS  error  over  15  days  (n-18). 
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Figure  16.  Critical  Tracking  T  st  correl a t i 
for  selected  base  days  (l,  2,  4,  9,  13)  and  those 
following  for  scores  over  15  days  (n-18). 


Mgure  19.  Time  bstimation  Test  means  and  stan^ 
dard  deviations  for  constant  error  over  15  days 
Cn-1  9") . 
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Figure  20.  '  Time  Estimation  Test  correlations  for 
selected  base  days  (1,  2,  4,  8,  10,  12)  and  those 
following  for  constant  error  over  15  days  (n*19). 
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Figure  23.  Spoke  Test  means  and  standard  devia¬ 
tions  for  experimental  task  (time  to  completion) 
over  15  days  (n«18). 
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Figure  24.  Spoke  Test  correlations  for  selected 
base  days  (1,  2,  4,  8,  10.  12)  and  those  following 
for  experimental  task  over  15  days  (n»18). 


Figure  21.  Spoke  Test  means  and  standard  devia¬ 
tions  for  control  task  (time  to  completion)  over 
15  days  (n«18) . 


Figure  22.  Spoke  Test  correlations  for  selected 
base  days  (,  1,2  4,  8,  10,  12)  and  those  following 
for  control  task  over  15  days  (n*18). 
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ABSTRACT 

The  basic  problem  with  performance  testing  in  exotic 
environments  is  the  general  unwillingness  of  investiga¬ 
tors  to  take  the  time  to  standardize  a  test  battery. 

Many  other  problems  exist  and  are  obvious  to  all  who 
have  tried  to  measure  performance  under  usual  and  un¬ 
usual  environmental  conditions.  It  is  the  purpose  of 
this  paper  to  set  forth  some  of  the  problems  that  have 
grown  out  of  our  experiences  snd  which  we  feel  have  not 
been  extensively  commented  upon  in  the  research  litera¬ 
ture,  and. also  to  describe  our  plan  for  solution. 

Preface 

The  present  plan  is  a  simple  one:  The  literature  will  be  searched  for  human  per¬ 
formance  tasks  which  have  been  shown  to  degrade  under  motion  (vibration  and  ship 
motion),  during  thermal  exposure,  and  under  pressure.  The  performances  that 
meet  these  first  criteria  will  be  categorized  as  cognitive  (decision  making,  in¬ 
formation  processing,  judgment),  motor  (tracking,  reaching),  etc.,  and  a  taxonomy 
of  performances  will  be  developed.  Additionally,  each  performance  task  will  be 
evaluated  in  the  following  way:  20  subjects  will  be  tested  10  times  (5  days/ 
week  for  2  weeks)  to  determine  three  types  of  reliability:  internal  consistency, 
the  accuracy  and  sensitivity  to  separate  individuals,  and  the  stability  of  this 
accuracy  and  sensitivity  over  repeated  testing.  Performances  on  these  tasks  will 
be  compared  to  scores  on  other  tests  of  mental  functions.  Progress  to  date  will 
be  reported. 

The  National  Aeronautics  and  Space  Administration,  the  Advanced  Research  Project 
Agency,  the  Navy  (via  the  Office  of  Naval  Research),  and  the  Bureau  of  Medicine 
and  Surgery  have  funded  several  studies  (see  Kennedy,  1977  for  a  review)  which 
have  nearly  all  made  very  similar  points  regarding  the  standardization  of  a  per¬ 
formance  test  battery  for  assessment  of  environmental  stressors.  In  the  main, 
test  batteries  have  been  proposed,  particularly  factor  analyzed  batteries,  but 
rarely  have  normative  data  been  collected  and  never  have  practice  effects  been 
studied  effectively. 

The  original  title  for  the  present  paper  was  very  broad  and  included  all  Navy 
R  &  D  concerning  performance.  We  intend,  however,  merely  to  present  how  the 
Naval  Aerospace  Medical  Research  Laboratory  Detachment  plans  to  research  the 
general  area,  with  specific  application  to  our  interests  in  the  effects  of  ship 
motion  or  performance.  It  should  be  noted  that,  in  addition  to  the  human  per¬ 
formance  R  &  D  already  presented  at  this  symposium  by  various  members  of  the  Navy 
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Personnel  Research  and  Development  Center, 
within  the  Engineering  Psychology  Programs 
within  the  Human  F.f fectlvenese  Programs  of 
ment  Command. 


complementary  programs  also  exist 
of  the  Office  of  Naval  Research  and 
the  Naval  Medical  Research  and  Develop- 
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Casual  observation  over  several  years  of  performance  testing  and  a  comprehensive 
reading  of  over  400  "human  performance  studies”  in  hyperbaria  (see  Bachrach  & 
Kennedy,  1977,  for  a  review)  suggest  that  there  is  a  need  for  future  studies 
into  the  standard) zat ion  of  a  human  performance  test  battery. 

In  our  opinion,  the  persons  who  initiated  the  experiments  requiring  performance 
testing  in  exotic  environments  were  generally  persons  who  became  involved  orig¬ 
inally  because  of  a  primary  interest  in  the  environment  rather  than  In  the  per¬ 
formance.  (Within  "environment"  we  include  unusual  sensory  stimulations,  drugs, 
fatigue,  and  even  learning,  as  well  as  motion  sickness,  hyperbarit,  etc.)  Thus, 
we  feel  that,  frequently,  several  criteria  were  employed  (often  trading  back  and 
forth  among  them)  in  the  selection  of  tasks  for  inclusion  in  a  battery  to  be 
assembled.  These  criteria  have  included  the  following: 

1.  Literature  findings  that  were  recollected,  probably  because  the  results 
of  tests  were  unusual. 

2.  What  colleagues  and  friends  had  done. 

3.  What  demonstration  experiments  were  performed  in  experimental  psychology 
laboratory  during  their  student  days. 

4.  Chapter  headings  in  Woodworth  and  Schlosberg  (1954)  and  other  standard 
texts. 

5.  Equipment  left  behind  in  the  storage  room  of  the  laboratory  by  their 
predecessors . 

6.  That  which  could  be  quickly  and  easily  assembled  from  clever  ideas,  (the 
so-called  toy  gadget  approach) . 

7.  Stock  items  from  apparatus  companies. 

8.  Logistic  limitations  forced  by  the  environment  or  project  (e.g.,  small, 
inexpensive,  no  tubes,  portable,  nonmagnetic,  self-scored,  no  sparks,  self- 
administered,  battery  powered,  and  rugged). 

9.  Similar  to  the  work  done  by  real-world  persons. 

10.  A  relatively  basic  kind  of  skill  is  involved;  that  is,  learning  theoret¬ 
ically  SHOULD  be  able  to  be  accomplished  quickly. 

11.  Less  often,  performances  could  be  expected  to  be  disrupted  on  the  task 
in  this  environment. 


We  believe  that  the  criteria  listed  above  have  been  employed  often  enough  to 
assemble  batteries  so  that  these  criteria  are  worth  citing.  It  should  also  be 
noted,  however,  that,  typically,  a  test  battery  was  generally  an  ad  hoc  response 
to  the  imminent  availability  of  an  environmental  condition,  whether  the  environ¬ 
ment  was  a  hurricane  (Kennedy,  Moroney,  Bale,  Gregoire,  &  Smith,  1970),  a  rotating 
room  (Guedry,  Kennedy,  Harris,  &  Graybiel,  1964;  Fregly  &  Kennedy,  1965;  Kennedy, 
Tolhurst  &  Graybiel,  1965),  or  a  deep  dive.  Thus,  long-range  planning  frequently 
is  not  possible.  In  summary,  it  is  felt  that  performance  test  batteries  are 
often  assembled  for  largely  practical  reasons,  on  short  notice,  by  persons  whose 
major  interest  is  not  performance  testing.  To  alleviate  these  problems  we  have 
combined,  in  tabular  form,  what  we  consider  the  traditional,  important  criteria 
for  test  construction  along  with  the  practical  aspects  concerning  operational 
performance  assessment.  These  criteria  are  summarized  in  Tables  1-4.  In  addi¬ 
tion,  other  problems  with  performance  test  battery  construction  exist. 

1 .  What  performance  tests  are  designed  to  measure 

Although  this  distinction  is  not  generally  made,  it  is  implicit  that  perform¬ 
ance  testing  is  undertaken  for  two  main  purposes:  first,  to  be  able  to  make 
some  statement  about  the  integrity  of  the  organism,  and  second,  to  determine 
whether  an  environment  interacts  with  an  organism's  ability  to  do  a  particular 
kind  of  work  (cf.  Table  3).  In  this  paper,  the  first  purpose  will  be  called 
"CNS  status,"  and  the  second,  "effectiveness  of  a  system's  output."  Examples 
of  tests  designed  for  the  former  purpose  include  reaction  time,  digit  span, 
tremor,  electroencephalogram,  speed  of  tapping,  and  CFF.  Examples  of  the  latter 
include  an  underwater  pipe  puzzle,  a  sonar  monitoring  task,  Morse  code  tests, 
and  speech  intelligibility  tasks.  Frequently,  both  types  of  tasks  are  included 
In  a  single  experiment  into  the  environment's  effect  on  man  and  without  regard 
to  the  distinction  made  above.  The  advantage  of  the  latter  approach  is  that  the 
system's  concept  is  used  and  the  translation  to  real-activities  is  direct.  (Also, 
subject  cooperation  is  usually  better.)  The  disadvantage  is  that  no  general 
principles  are  adduced  and  the  application  of  the  findings  holds  only  for  the 
stimulus  condition  employed.  For  instance,  tracking  studies  with  CRT  displays 
have  been  conducted  for  many  years  and  very  few  general  rules  have  resulted 
(Adams,  1961).  The  major  disadvantage  of  the  first  approach  (index  of  an  orga¬ 
nism's  integrity)  is  that  they  depend  heavily  upon-  the  knowledge  of  the  validity 
of  the  task.  If  only  face  validity  is  available,  other  considerations  (money, 
size,  apparatus,  and  availability)  must  be  used  to  justify  inclusion.  If  face 
validity  is  not  evident,  then  Justification  is  very  tenuous. 

The  distinction  made  between  these  two  strategies  is  subtle,  but  it  is  also  real, 
and  its  existence  complicates  the  results  of  many  studies.  This  is  chiefly  due 
to  the  fact  that  the  two  approaches  require  different  research  philosophies, 
although  the  ultimate  aim  of  both  approaches  is  similar:  namely,  predict  Ion 
(i.e.,  an  ability  to  account  for  100  percent  of  the  variance). 

The  first  approach  comes  directly  from  experimental  psychology  and  usually  fol¬ 
lows  an  analysis  of  variance  model.  Thus,  the  numerous  tests  in  a  test  battery 
are  designed  to  sample  all  of  the  skills  (factors)  of  the  organism.  The  impli¬ 
cation  is  that,  if  the  full  range  of  human  abilities  is  tested,  one  can  general¬ 
ize  the  findings  and  apply  them  to  other  circumstances  (e.g.,  subjects,  treat¬ 
ments,  etc.).  This  approach  depends  heavily  upon  following  the  principles  of  test 
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<on8i  ruction:  (1)  norms,  (2)  rel  iabl  1 1 1  Ioh  ,  (3)  validities,  (''•)  f.i.iins  tested. 
(5)  effects  of  practice,  and  (f>)  Individual  differences.  If  all  those  pi  lm  I 
pies  were  satisfactorily  fulfilled.  It  would  be  poasible  to  employ  the  test  In 
an  exotic  environment  and  account  for  all  the  main  effects  of  such  an  environ¬ 
ment  on  human  performance.  For  example,  if  it  were  known  that  hand  dyn.imometry 
correlated  perfectly  with  all  other  kinds  of  voluntary  skeletal  muscle  output, 
and  the  Harvard  Step  Test  (Kennedy  &  Hutchins,  1971)  with  all  cardiac  muscle 
output,  then  it  would  not  be  necessary  to  use  other  tests  of  these  functions. 

The  difficulty,  of  course,  is  that  neither  of  these  tests  correlates  sufficiently. 
Additionally,  other  "more  psychomotor"  tasks  are  even  less  clear-cut  with  regard 
to  what  they  are  measuring  (i.e.,  validities).  However,  the  problem  does  not  end 
here.  Reliabilities  of  a  test  battery--any  test  battery — are  not  completely 
known.  No  norms  (expected  values)  are  available  on  a  sizable  population,  par¬ 
ticularly  when  practice  effects  are  concerned.  However,  factor  analyses  studies 
(e.g.,  those  of  Fleischman)  have  been  completed  for  some  samples.1 

The  second  approach  is  in  vogue  more  now  than  previously,  probably  because  it 
emphasizes  a  systems  approach.  The  statistical  model  employed  is  correlation, 
and  in  general,  single  factor  studies  are  conducted.  The  overall  plan  is  to 
replicate  real-world  work  and  to  do  it  under  controlled  conditions.  The  second 
approach  does  not  depend  upon  the  validity  of  the  task  a6  heavily  a9  the  first 
method,  since  it,  itself,  is  the  work.  However,  the  characteristics  of  the  sub¬ 
jects  are  critical.  It  is  important,  and  usually  essential,  that  the  subjects 
be  the  same  kind  of  people  as  the  real-world  workers  toward  whom  the  data  will 
be  applied.  The  shortcoming  of  this  strategy  is  also  its  chief  advantage:  the 
application  of  the  findings  from  such  studies  is  specific  and  immediate,  but 
sometimes  it  is  so  specific  that  generalization  within  the  same  environment, 
but  with  slight  differences,  may  not  be  possible. 

2 .  Two  experimental  paradigms 

There  are  two  main  ways  in  which  to  study  the  effects  of  the  environment  on  a 
subject's  ability  to  do  work.  The  first  (most  often  used)  uses  the  subject  as 
his  own  control  and  generally  follows  a  pre-,  per-  and  post-  paradigm.  In  the 
pretest,  the  subject  is  practiced  on  all  the  tests  to  be  employed  in  order  to 
arrive  at  a  learning  plateau.  Then  he  is  placed  in  the  experimental  situation 
to  see  whether  or  not  it  disrupts  performance.  Posttesting  is  used  to  monitor 
recovery  effects,  if  there  are  any.  There  are  many  problems  with  this  approach. 
Chiefly,  psychomotor  performance  almost  never  arrives  at  a  plateau.  This  is 
discussed  in  more  detail  later  in  this  paper.  Asymptotes  occasionally  are  ob¬ 
tained,  but  these,  too,  are  infrequent.  Even  on  tests  where  one  would  expect 
practice  to  be  accomplished  quickly  (e.g.,  reaction  time,  CFF,  tracking  visual 
acuity),^  the  environment  itself  occasionally  causes  certain  tests  to  be  per¬ 
formed  less  well  while  standing  during  rotation,  and  is  probably  also  measuring 


*Sinbad  (1969)  is  based  on  these  studies  and,  when  standardized,  may  be 
used  to  obviate  some  of  the  problems  mentioned  above. 

2 

The  use  of  signal  detection  theory  (Swet,  Tanner,  &  Birdsall,  1961)  as 
a  methodology  may  be  helpful  here,  but  as  we  all  know  from  the  way  the  100-yard 
dash  record  is  continually  broken,  it  is  not  just  a  criterion  problem.  Stated 
differently,  a  knowledge  of  sensory  sensitivity,  d'  (d-prime)  separated  from 
the  subject's  criterion  (beta)  would  refine  present  knowledge,  but  d’,  even 
carefully  and  prudently  measured,  may  change  with  practice. 
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body  sway  (Graybiel,  Kennedy,  Knoblock,  Guedry,  Mertz,  Mcleod,  Colehour,  Miller, 

&  Fregly,  1965).  This  point  will  also  be  discussed  later.  Foat-effects  also 
present  difficulties  since  motivation  changes  (e.g.,  end  spurt  in  vigilance) 
usually  attend  the  imminent  completion  of  an  experiment. 

The  alternative  approach:  to  test  "just  before"  and  "just  after"  the  environ¬ 
mental  exposure  (say  a  12-hour  overwater  ASW  flight)  has  its  own  problems; 
namely,  the  experimenter  feels  that  it  is  necessary  to  be  aware  of  the  status 
of  the  subject  during  the  exposure.  If  the  testing  is  short  (e.g.,  hand  dyna- 
mometry).  It  can  be  influenced  by  the  bias  of  a  subject  and  summoning  efforts 
for  a  "one-shot-deal"  so  that,  often,  changes  are  not  obtained  even  though  the 
subject  is  frankly  tired.  If  the  testing  period  is  long  (e.g.,  treadmill),  it 
can  contribute  to  the  fatigue.  In  addition,  lengthy  posttests  are  often  unfair 
to  the  subject. 

3 .  Assessment  of  input- integrator-output  circuits 

The  general  form  of  psychological  experimentation  fellows  an  S-R  paradigm,  or 
SOR,  where  0  is  for  organism  (Graham,  1951).  Performance  testing  employs  this 
paradigm  particularly  when  "CNS  status"  type  experiments  are  conducted.  Typi¬ 
cally,  in  these  studies  the  experimenter  is  mainly  interested  in  whether  his 
treatment  (drugs,  hypoxia,  confinement,  magnetic  fields)  produces  any  CNS  change. 
So,  a  stimulus  is  presented  and  the  output  of  the  organism  is  monitored  for 
changes.  Frequently,  however,  due  account  is  not  taken  as  to  whether  the  stim¬ 
ulus  was  adequately  received  by  the  receptor  (retina,  ear,  hair  cells,  etc.)  then 
properly  delivered  along  that  nerve  pathway;  also,  whether  the  output  (muscle) 
pathway  is  similarly  unaffected.  For  example,  during  acceleration  stress,  the 
lack  of  oxygen  to  the  retina  indicates  that  signals  are  not  adequately  received 
at  the  receptor  site.  This  also  occurs  with  the  differences  obtained  in  visual 
performance  underwater.  The  physical  conduction  of  light  in  air  versus  water 
may  account  for  these  differences  —  most  likely  the  visual  signal  is  Just  not 
delivered  to  the  receptor  in  water  as  well  as  in  air,  so  one  would  not  posit 
CNS  changes  underwater  to  account  for  the  poorer  visual  acuity  obtained.  At  the 
other  end  of  the  nerve-muscle  circuit,  changes  in  four-choice  reaction  time  done 
underwater  clearly  have  the  friction  of  water  on  the  one  hand  to  slow  down  per¬ 
formance  as  well  as  the  possible  other  effects  of  compression  and  mixed  gases 
and  ao,  probabLy,  CNS  changes  cannot  adequately  be  assessed  with  this  task.  So, 
too,  past  pointing  underwater  may  be  different:  not  because  of  central  involve¬ 
ment,  but  because  of  inertial  differences  on  the  arm.  This  is  not  to  imply  that 
such  studies  should  not  be  undertaken,  rather,  it  behooves  the  experimenter  to 
indicate  where  possible  which  part  of  the  OSR  circuit  he  is  testing.  Therefore, 
one  must  know  about  the  transmission  characteristics  of  light,  the  dependency 
of  the  retina  on  oxygen,  and  the  viscosity  and  buoyancy  characteristics  of  water. 
However,  If  such  tasks  are  included  in  batteries  that  have  other  tests,  (the 
intention  of  which  is  to  tap  the  state  of  the  CNS)  when  all  results  are  reported 
together,  there  is  confusion. 

It  would  be  useful  to  other  investigators  if  results  of  experiments  were  reported 
relative  to  that  part  of  the  circuit  which  is  being  tested.  This  cannot  be  done 
in  all  cases,  but  it  is  possible  to  improve  present  reporting  practices.  Per¬ 
haps  if  we  intellectually  remove  the  known  physical  environmental  effects  from  the 
periphery  (nerve  and  muscle),  we  may  be  left  with  the  finding  that  motivation 


and  the  partial  preaaurc  of  oxygen  in  the  brain  are  the  chief  contributors  to 
performance  decrement  under  all  conditions.  The  above  cri t ii-inm  does  not  apply 
to  the  "systems  output"  type  of  studies  which  take  no  poult  ion  regarding  where 
in  the  circuit  the  problem  occurs.  Rather,  their  sole  purpose  ie  to  determine 
whether  an  interaction  of  environmental  condition  occurs  on  peop' "  doing  work. 

It  i6  proposed  that  "CNS  status"  be  used  as  a  term  to  be  contracted  with  "input/ 
output  quality"  types  of  studies,  whereby  the  former  would  deal  with  throughput 
changes  due  to  the  environment  and  the  latter  would  address  the  physical  aspect d 
of  the  environment  on  man. 

4.  Practice  effects 

In  a  significant  but  not  widely  referenced  paper,  Bradley  (1962)  reported  the 
persistence  of  sequence  effects  during  psychomotor  testing.  Virtually  all  who 
study  performance  over  many  sessions  have  obtained  similar  findings.  As  was 
mentioned  earlier,  the  investigator  usually  performs  baseline  pretesting  before 
placing  the  subjects  in  the  environment.  Often,  many  trials  are  given  (in  one 
study,  7  days  of  testing)  in  an  effort  to  have  performance  asymptotic  "so  that 
the  pimple  on  the  line  can  be  more  easily  seen."^  What  is  usually  obtained  is 
the  well-known  learning  curve,  which  may,  but  does  not  always,  asymptote.  The 
problem  with  this  approach  is  obvious,  but  there  is  another  less  obvious  problem; 
that  is,  performance  on  a  task  after  many  trials  is  probably  no  longer  an  index 
of  the  same  activity  or  place  in  the  CNS  that  it  was  initially. 

Studies  by  Ades  and  Raab,  1949,  on  the  Kluver  Bucy  Syndrome  (cited  in  Bachrach 
and  Kennedy,  1977)  illustrate  the  latter  point  where  animals  with  certain  portions 
of  their  brains  removed  were  able  to  perform  a  visual  discrimination  task  about 
as  well  as  unoperated  animals;  however  a  simularly  operated  group  was  never  able 
to  learn  this  task. 

Moreover,  it  is  well  known  from  the  learning  literature  that,  with  extended 
practice,  subjects  overlearn,  and  when  something  is  overleamed,  it  becomes 
more  resistant  to  extinction.  Therefore,  for  performance  testing  in  exotic 
environments,  if  intensive  practice  is  given  on  the  tests  prior  to  their  use 
in  the  experimental  environment,  two  factors  appear  inevitable:  (1)  the  work 
is  not  an  index  of  what  it  was  at  first,  and  (2)  disruption  of  performance  be¬ 
comes  very  difficult.  An  example  of  this  is  as  follows:  move  the  index  (first) 
and  ring  (third)  fingers  preferred  hand  together  with  the  palms  resting  on  a 
flat  surface.  Then  move  the  second  and  fourth  fingers  together.  Then,  alternate 
1  and  3,  then  2  and  4,  etc.  Everyone  can  do  this  work,  but  it  requires  far  more 
concentration  for  the  average  person  than  for  a  person  who  frequently  plays  the 
piano.  The  investigators  believe  that  control  for  this  activity  is  exerted  high 
in  the  cortex  for  nonpianists,  but  has  perhaps  been  shunted  to  a  lower  center  in 
the  CNS  in  practiced  pianists.  If  the  above  is  Bimilar  to  what  occurs  in  per¬ 
formance  testing  studies,  the  implications  are  obvious. 

Because  of  the  problems  listed  above,  the  following  approach  is  planned:  We 
feel  that  the  approach  is  innovative,  but  it  will  draw  heavily  on  the  research 
literature  for  the  initial  selection  of  tests  to  be  included  for  further  study. 


Radloff,  1971,  personal  communication. 
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Those  tests  will  be  selected  from  the  literature  that  meet  criteria  in  one  of 
the  following  areas:  (l)  demonstrated  sensitivity  to  either  thermal,  motion, 
or  hyperbaric  environments  by  exhibiting  degraded  performances,  (2)  diagnostic 
capability  (i.e.,  brain-damaged  Individuals  have  been  found  to  perform  differ¬ 
ently  from  a  normal  population),  and  (3)  measurement  capability  of  a  parameter 
of  human  information  processing.  After  initial  selection  of  the  tests,  the  tos*- 
promising  will  be  subjected  to  further  tests.  The  test  and  equipment  attributes 
of  each  test  will  be  viewed  from  the  standpoint  of  the  following  factors  ranked 
Ln  general  order  of  importance:  (1)  reliability  (e.g.,  test-retest,  alternate 
form,  between  and  within  administrations),  (2)  validity  (e.g.,  predictive,  con¬ 
text,  construct,  diagnostic-concurrent,  fact),  (3)  other  practical  test  factors 
(range  of  capability  levels  covered,  sensitivity,  transportability,  efficiency), 
(4)  equipment  factors  (e.g.,  availability,  equipment  reliability,  transformabil- 
ity,  safety,  economy).  Those  tests  that  demonstrate  a  high  level  of  adequacy  on 
the  above  criteria  will  comprise  an  experimental  battery.  Performances  on  this 
battery  will  be  compared  to  performances  on  a  factor  pure  (e.g.,  Sinbad)  battery 
to  determine  uniqueness  of  factors.  Paper  and  pencil  tests  of  cognitive  func¬ 
tions  (e.g.,  Bender-Gestalt,  Guillford-Zinmerman)  as  well  as  well-standardized 
intelligence  tests  (e.g.,  Wais,  Ravens,  Stanford-Binet,  Reitan,  Halstead, 
Wunderlich)  will  be  administered  to  this  same  population  to  further  delineate 
and  validate  the  factors  obtained. 

The  first  test  that  we  have  selected  for  further  study  is  the  so-called  Beeper 
reviewed  by  Kennedy  and  Bruns  (1975).  The  reasons  for  selecting  this  test  orig¬ 
inate  partly  from  the  literature  review  and  partly  from  the  study  of  acceleration 
stress  by  the  NAS/NRC  Committee  on  Bio-Astronautics,  who  convened  a  working  group 
headed  by  Robert  Galambos'to  discuss  and  report  on  principles  and  problems  of 
performance  testing.  Using  criteria  based  largely  on  earlier  suggestions  of 
Broadbent  (1953),  a  performance  test  battery  was  proposed  that  would  have  gen¬ 
eral  and  specific  applications. 

We  looked  into  Broadbent 's  report  for  ideas  relative  to  the  common  problems  of 
motion  and  acceleration  stress  and  of  exotic  environments  in  general.  Recom¬ 
mendations  were  also  included  for  the  use  of  tasks  which  are:  "(a)  work  paced; 
(b)  require  vigilance;  (c)  over  a  long  period  of  time;  and  (d)  during  which 
there  is  uncertainty  in  the  stimulus  display"  (p.  22): 

1.  Laboratory  norms  on  six  different  versions  of  this  task  for  each  of  the 
approximately  100  college  graduate  males  are  available,  as  well  as  relationships 
to  personality  and  other  subject  variables  (e.g.,  hours  of  sleep)  for  these 
persons . 


2.  Neurophysiological  correlates  (vestibular  nystagmus)  of  performance  were 
shown. 

3.  Practice  effects  appear  small  on  the  three-channel  auditory  version  and 
are  known  for  the  three-channel  visual  version. 

4.  The  test  can  be  group-administered. 

5.  It  is  relatively  simple  and  inexpensive  to  construct. 

6.  There  are  many  possibilities  for  constructing  alternate  forms. 
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7.  Task  difficulty  can  be  controlled  largely  by  Instructions. 

8.  Latency  of  response  within  broad  limits  (namely,  1-2  seconds)  is  gen¬ 
erally  not  a  factor  and  so  the  task  can  appropriately  be  used  even  when  environ¬ 
mental  variables  can  Interact  physically  with  response  speed  (e.g.,  underwater). 

9.  Stimulus  recording  Is  binary  and  therefore  is  mechanically  simple.  Fur 
ther,  the  regularity  of  the  st imul  1  makes  a  scoring  relatively  easy  and  relatively 
independent  of  where  on  the  magnetic  tape  a  session  begins. 

10.  Proportion  measures  are  essentially  linear  (R  .95)  with  absolute  measures 
(namely,  hits)  and,  therefore,  direct  comparisons  can  be  made  over  different 
tasks. 

11.  Unlike  many  other  vigilance  tasks,  many  signals  and  responses  occur  and 
so  individual  time-line  analyses  are  possible. 

12.  The  results  suggest  that  performance  on  forms  of  this  task  may  be 
age-related. 

The  approach  we  have  utilized  includes  the  daily  administration  (15  minutes)  of 
the  Beeper  for  2  weeks  to  study  the  reliability  of  the  test  in  three  ways: 
internal  consistency,  the  accuracy  and  sensitivity  to  separate  individuals,  and 
stability  of  this  accuracy  and  sensitivity  over  repeated  testings. 

We  feel  that  this  approach  will  serve  as  a  model  for  future  tasks  to  be  included 
in  our  battery.  At  this  writing,  data  are  being  collected,  however  the  study  is 
not  completed.  These  results  should  be  available  at  the  meeting  in  October. 
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Lbility  Equipment  software  and  hardware  Alluisi  (1967,  Rose  (1974)  has  suggested  paradigm  "reproducibilit 
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